Skip to content

test: add e2e test for NVIDIA device plugin as DaemonSet#7964

Merged
ganeshkumarashok merged 5 commits intomainfrom
aganeshkumar/nvidia-device-plugin-daemonset-e2e
Mar 3, 2026
Merged

test: add e2e test for NVIDIA device plugin as DaemonSet#7964
ganeshkumarashok merged 5 commits intomainfrom
aganeshkumar/nvidia-device-plugin-daemonset-e2e

Conversation

@ganeshkumarashok
Copy link
Contributor

Summary

  • Add a new e2e test that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service
  • This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment

Test Details

The test Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset:

  • Provisions a GPU node (Standard_NV6ads_A10_v5) with GPU drivers enabled but systemd device plugin disabled
  • Deploys nvidia-device-plugin:v0.18.2 from MCR (mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin) as a DaemonSet
  • Validates:
    • GPU resources are advertised by the device plugin
    • GPU workloads can be scheduled on the node

Test plan

  • CI pipeline runs the new GPU e2e test
  • Verify the DaemonSet-based device plugin properly registers GPU resources
  • Verify GPU workloads can be scheduled

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds a new e2e test (Test_Ubuntu2204_NvidiaDevicePlugin_Daemonset) that validates GPU nodes work correctly when the NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of a systemd service. This tests the upstream deployment model commonly used by customers who manage their own device plugin deployment.

Changes:

  • Adds a new GPU e2e test that provisions a Standard_NV6ads_A10_v5 GPU node with drivers enabled but systemd device plugin disabled
  • Deploys nvidia-device-plugin:v0.18.2 from MCR as a DaemonSet
  • Validates GPU resource advertisement and workload scheduling with DaemonSet-based device plugin

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

Comment on lines +181 to +189
if pod.Status.Phase == corev1.PodRunning {
// Check if all containers are ready
for _, containerStatus := range pod.Status.ContainerStatuses {
if !containerStatus.Ready {
s.T.Logf("Container %s is not ready yet", containerStatus.Name)
return false, nil
}
}
return true, nil
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait loop checks if any containers in the pod are not ready, but it only logs which container is not ready without failing immediately on errors like CrashLoopBackOff or ImagePullBackOff. If the pod enters a failed state, the test will wait the full 3 minutes before timing out.

Consider checking pod.Status.ContainerStatuses[].State.Waiting.Reason for error states (ImagePullBackOff, CrashLoopBackOff, etc.) and returning an error immediately rather than continuing to poll. This would make test failures faster and provide clearer error messages.

Copilot uses AI. Check for mistakes.
Comment on lines +117 to +120
AllowPrivilegeEscalation: to.Ptr(false),
Capabilities: &corev1.Capabilities{
Drop: []corev1.Capability{"ALL"},
},
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SecurityContext configuration may prevent the device plugin from functioning correctly. The NVIDIA device plugin needs to interact with GPU devices and the kubelet device plugin socket, which typically requires either:

  1. Privileged mode (Privileged: true), OR
  2. Specific capabilities like SYS_ADMIN

The current configuration only drops all capabilities without granting the necessary ones. The upstream NVIDIA device plugin DaemonSet spec typically does NOT use AllowPrivilegeEscalation: false with dropped capabilities - it usually runs in privileged mode or with specific capabilities.

Without proper privileges, the device plugin won't be able to enumerate GPUs or create device plugin sockets. This could cause the test to fail when it tries to validate GPU resources are advertised. Verify this configuration works, or update it to match the upstream privileged configuration.

Suggested change
AllowPrivilegeEscalation: to.Ptr(false),
Capabilities: &corev1.Capabilities{
Drop: []corev1.Capability{"ALL"},
},
Privileged: to.Ptr(true),

Copilot uses AI. Check for mistakes.
Containers: []corev1.Container{
{
Name: "nvidia-device-plugin-ctr",
Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image tag v0.18.2 does not match the version format used in parts/common/components.json (0.18.2-ubuntu22.04u1). While MCR may support the simplified v0.18.2 tag, it's better to use the exact version that aligns with what's cached on VHDs and referenced in components.json for consistency. This ensures the test validates the same version that production nodes would use.

The components.json shows nvidia-device-plugin versions like "0.18.2-ubuntu22.04u1" for Ubuntu 22.04. Consider using the full version tag or documenting why the simplified tag is intentional.

Copilot generated this review using guidance from repository custom instructions.
ds := nvidiaDevicePluginDaemonset(s.Runtime.VM.KubeName)
err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds)
require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet")

Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DaemonSet is created but never explicitly cleaned up. While Kubernetes may garbage collect when the test cluster is torn down, it's better practice to add explicit cleanup to prevent resource leaks if the test fails before cluster cleanup. E2E tests should be self-contained and clean up resources they create.

Consider using s.T.Cleanup() or defer to delete the DaemonSet at the end of the validator function, similar to how other e2e tests handle resource cleanup.

Suggested change
// Ensure the DaemonSet is cleaned up after the test finishes
s.T.Cleanup(func() {
s.T.Logf("Cleaning up NVIDIA device plugin DaemonSet %s/%s...", ds.Namespace, ds.Name)
if err := s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
context.Background(),
ds.Name,
metav1.DeleteOptions{},
); err != nil {
s.T.Logf("Failed to delete NVIDIA device plugin DaemonSet %s/%s: %v", ds.Namespace, ds.Name, err)
}
})

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +121
Containers: []corev1.Container{
{
Name: "nvidia-device-plugin-ctr",
Image: "mcr.microsoft.com/oss/v2/nvidia/k8s-device-plugin:v0.18.2",
Env: []corev1.EnvVar{
{
Name: "FAIL_ON_INIT_ERROR",
Value: "false",
},
},
SecurityContext: &corev1.SecurityContext{
AllowPrivilegeEscalation: to.Ptr(false),
Capabilities: &corev1.Capabilities{
Drop: []corev1.Capability{"ALL"},
},
},
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DaemonSet container spec is missing several configuration options present in the upstream NVIDIA device plugin deployment (https://github.com/NVIDIA/k8s-device-plugin/blob/main/deployments/static/nvidia-device-plugin.yml). Specifically:

  1. Missing the /dev/null device volume mount which the device plugin may need
  2. The upstream spec typically includes additional environment variables like NVIDIA_MIG_MONITOR_DEVICES, NVIDIA_DRIVER_CAPABILITIES, etc.
  3. The upstream spec sets Privileged: true in SecurityContext rather than just dropping capabilities

While the minimal spec may work for basic testing, using a configuration closer to the upstream deployment would make this test more valuable as it validates the real-world customer scenario. Consider either aligning more closely with the upstream spec or adding a comment explaining why this simplified configuration is sufficient for the test's purpose.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +57
waitForNvidiaDevicePluginDaemonsetReady(ctx, s)

// Validate that GPU resources are advertised by the device plugin
ValidateNodeAdvertisesGPUResources(ctx, s, 1, "nvidia.com/gpu")

// Validate that GPU workloads can be scheduled
ValidateGPUWorkloadSchedulable(ctx, s, 1)
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After waiting for the device plugin pod to be ready, consider adding validation to check the pod logs for successful GPU discovery and device plugin registration. This would catch issues where the pod starts but the device plugin fails to function correctly (e.g., driver compatibility issues, incorrect configuration).

The logs should contain messages about discovering GPUs and registering with kubelet's device plugin framework. This would make the test more comprehensive and catch edge cases where the pod runs but doesn't actually register GPU resources properly.

Copilot uses AI. Check for mistakes.
Copy link
Member

@surajssd surajssd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm

Add a new e2e test that validates GPU nodes work correctly when the
NVIDIA device plugin is deployed as a Kubernetes DaemonSet instead of
a systemd service. This tests the upstream deployment model commonly
used by customers who manage their own device plugin deployment.

The test:
- Provisions a GPU node with drivers but without systemd device plugin
- Deploys nvidia-device-plugin v0.18.2 as a DaemonSet from MCR
- Validates GPU resources are advertised and workloads can be scheduled
ganeshkumarashok and others added 4 commits March 3, 2026 10:41
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Use unique DaemonSet name per node to avoid collisions in shared cluster
- Add cleanup to delete DaemonSet when test finishes
- Use Privileged mode matching upstream NVIDIA device plugin spec
- Use existing WaitUntilPodRunning helper instead of custom wait loop
- Add comments explaining image version choice
- Extract image version to constant for easier updates
- Add validation that systemd device plugin is not running
- Truncate DaemonSet name to 63 chars (K8s limit)
- Add timeout contexts to cleanup operations
- Delete existing DaemonSet before create for idempotency
Copilot AI review requested due to automatic review settings March 3, 2026 18:42
@ganeshkumarashok ganeshkumarashok force-pushed the aganeshkumar/nvidia-device-plugin-daemonset-e2e branch from 54fe957 to 2eae10e Compare March 3, 2026 18:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment on lines +85 to +86
// Using "is-active" which returns non-zero if not active
result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemctl is-active ... || echo 'not-running' can produce multi-line stdout (e.g., inactive plus not-running) because is-active prints a status string even when it exits non-zero. This makes the log/output ambiguous and can mask unexpected states; consider using systemctl is-active --quiet ... (and then echo a single status) or otherwise ensure only one deterministic status string is captured.

Suggested change
// Using "is-active" which returns non-zero if not active
result := execScriptOnVMForScenario(ctx, s, "systemctl is-active nvidia-device-plugin.service 2>/dev/null || echo 'not-running'")
// Using "is-active --quiet" and echoing a single deterministic status string
result := execScriptOnVMForScenario(ctx, s, "if systemctl is-active --quiet nvidia-device-plugin.service 2>/dev/null; then echo 'active'; else echo 'not-running'; fi")

Copilot uses AI. Check for mistakes.
Comment on lines +98 to +105
func nvidiaDevicePluginDaemonsetName(nodeName string) string {
prefix := "nvdp-" // Short prefix to leave room for node name
maxLen := 63
name := prefix + nodeName
if len(name) > maxLen {
name = name[:maxLen]
}
return name
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidiaDevicePluginDaemonsetName truncates to 63 chars but doesn’t ensure the result is a valid DNS-1123 label (must end with an alphanumeric). If truncation lands on -, DaemonSet creation/label matching will fail. Consider trimming trailing - after truncation (similar to truncatePodName in validators.go) and, to avoid collisions across long node names, appending a short hash suffix derived from the full node name.

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +213
deleteCtx, deleteCancel := context.WithTimeout(ctx, 30*time.Second)
defer deleteCancel()
_ = s.Runtime.Cluster.Kube.Typed.AppsV1().DaemonSets(ds.Namespace).Delete(
deleteCtx,
ds.Name,
metav1.DeleteOptions{},
)

// Create the DaemonSet
err := s.Runtime.Cluster.Kube.CreateDaemonset(ctx, ds)
require.NoError(s.T, err, "failed to create NVIDIA device plugin DaemonSet")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-delete of the DaemonSet is asynchronous; immediately calling create with the same name can still fail with AlreadyExists while the old DaemonSet is terminating. To avoid flaky reruns after a failed/aborted test, consider waiting for the DaemonSet to be fully deleted (poll for NotFound) or switching to an idempotent CreateOrUpdate that actually updates the existing DaemonSet spec.

Copilot uses AI. Check for mistakes.
@ganeshkumarashok ganeshkumarashok merged commit ca8a90e into main Mar 3, 2026
21 of 29 checks passed
@ganeshkumarashok ganeshkumarashok deleted the aganeshkumar/nvidia-device-plugin-daemonset-e2e branch March 3, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants